-
Notifications
You must be signed in to change notification settings - Fork 2.1k
BooleanQuery rewrite for must_not RangeQuery clauses #17655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BooleanQuery rewrite for must_not RangeQuery clauses #17655
Conversation
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
❌ Gradle check result for d9eee10: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <[email protected]>
@peteralfonsi -- can you write + run some tests when docs have more or fewer than 1 value for the field being queried? I think this optimization makes sense when each doc has exactly 1 value for the field, but I think we'll run into problems in the following cases:
The good news is that (at the shard level), we can cheaply check for the "exactly one value" case. There's Lucene code that does something similar here. |
@msfroh This is a good point I didn't consider. For the missing values case, we can also add a third should clause which is just The >1 value case is trickier. Would it be crazy for OpenSearch itself to keep track of which fields have documents with >1 value, on a per-shard basis? This could be updated at indexing/refresh time maybe. I think it could be worthwhile for more than just this rewrite. It would at least apply to 2 of the other BooleanQuery rewrites I'm planning. Alternatively are there any other ways to check this that already exist? |
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
Sorry for the very long delay, was caught up with other things. Looks like the method described by @froh is actually doable at QueryBuilder rewrite time, since we have access to |
❕ Gradle check result for fc49ddd: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
This PR is stalled because it has been open for 30 days with no activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting back to this one. I accidentally lost track of it.
The additional changes look safer, thanks! I just had a small clean-up suggestion on getting the LeafReaderContext
s.
server/src/main/java/org/opensearch/index/query/BoolQueryBuilder.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
❌ Gradle check result for feed2e8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <[email protected]>
❕ Gradle check result for c273159: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Peter Alfonsi <[email protected]>
Hey @msfroh , any further comments on this one? |
❌ Gradle check result for c4d29f2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Flaky test: #18302 |
Signed-off-by: Peter Alfonsi <[email protected]>
Hey @msfroh , just bumping on this |
Description
This PR automatically rewrites boolean queries which have a must_not RangeQuery clause to instead use a should clause of the complement of that range. This can be 2-30x faster depending on the query. See #17586 where this is described in more detail.
Example original query (on nyc_taxis):
Rewritten query:
Some benchmark numbers from http_logs and nyc_taxis (excluded ranges are on
@timestamp
anddropoff_datetime
fields respectively). "Originally written as" means whether the query was sent to OpenSearch with amust_not
clause, or if it was sent already rewritten withshould
clauses. Ideally, after the changes are applied, these p50s should be the same.I believe the small differences between runs (for example, 7/1-9/1
should
going from 427 -> 405 ms, when we'd expect no change) is just due to variation between different runs/instances. This is expected from what I've seen in tiered caching benchmarks. I've done a few runs and the direction/magnitude of the changes vary.Related Issues
Part of #17586
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.